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Abstract 

XML keyword search is a user-friendly way to query XML data using only keywords. In XML 
keyword search, to achieve high precision without sacrificing recall, it is important to remove spurious 
results not intended by the user. Efforts to eliminate spurious results have enjoyed some success 
by using the concepts of LCA or its variants, SLCA and MLCA. However, existing methods still 
could find many spurious results. The fundamental cause for the occurrence of spurious results is 
that the existing methods try to eliminate spurious results locally without global examination of 
all the query results and, accordingly, some spurious results are not consistently eliminated. In 
this paper, we propose a novel keyword search method that removes spurious results consistently 
by exploiting the new concept of structural consistency. We define structural consistency as a 
property that is preserved if there is no query result having an ancestor-descendant relationship at 
the schema level with any other query results. A naive solution to obtain structural consistency 
would be to compute all the LCAs (or variants) and then to remove spurious results according to 
structural consistency. Obviously, this approach would always be slower than existing LCA-based 
ones. To speed up structural consistency checking, we must be able to examine the query results 
at the schema level without generating all the LCAs. However, this is a challenging problem since 
the schema-level query results do not homomorphically map to the instance-level query results, 
causing serious false dismissal. We present a comprehensive and practical solution to this problem 
and formally prove that this solution preserves structural consistency at the schema level without 
incurring false dismissal. We also propose a relevance-feedback based solution for the problem where 
our method has low recall, which occurs when it is not the user's intention to find more specific 
results. This solution has been prototyped in a full-fledged object-relational DBMS. Experimental 
results using real and synthetic data sets show that, compared with the state-of-the-art methods, 
our solution significantly 1) improves precision while providing comparable recall for most queries 
and 2) enhances the query performance by removing spurious results early. 



1 Introduction 



As XML becomes the standard for data representation and exchange on the Internet, querying XML 
data has become an important issue [28]. Research work in this area can be classified into two categories: 
the structured query approach and the keyword query approach [28] . Both approaches have tradeoffs. 
The structured query approach specifies the precise structure of the desired results using a structured 
query language such as XPath and XQuery. However, it is hard to formulate queries without prior 
knowledge about structured query languages or without knowing the schema of the XML data. The 
keyword query, on the other hand, can overcome this problem by requiring only keywords rather than 
specific structure information. This approach, however, might not deliver precise results since it does 
not contain precise structures. 

In the structured query, the user's query intention can be expressed as either a single structured 
query or multiple structured queries, depending on the heterogeneity of the underlying XML data. If 
there is only one structure matching the user's intention at the schema level, that intention can be 
expressed in a single structured query. However, if there are multiple structures matching the user's 
intention, multiple structured queries for those structures must be composed. 

Example 1 The XML data in Fig.QJa) represent bibliographic data on conference publications. Sup- 
pose that a user intends to find the publications of "Levy" on "XML". This query can be stated as a 
single structured query, Q\\ in the keyword query, it is represented as "XML Levy". The query result is 
{paper(6)}. Here, we denote the subtree rooted at node p as p in the same way as is done by Xu and 
Papakonstantinou [46] . 

Qi: /bib/conf/paper["XML"]["Levy"[i] □ 

Example 2 The XML data in Fig. [ljb) represent bibliographic data on conference and journal publi- 
cations. Here, the subtree rooted at conf(l) is the same as in Fig. HJa). Since there are two structures 
matching the user's intention, one for conference papers and the other for journal articles, a union of 

1 For ease of exposition, we denote the predicate that checks whether a keyword w is contained in an element e as 
e["w"] instead of e [contains(., "w")] that uses the contains function in the XPath standard. 
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multiple structured queries, Qi-, must be used to find the desired results despite the same query intention 
as in Example [T] Note that we still use the same keyword query as in Example [TJ The query results 

are {paper(6), article(lOl)}. 

Q2' /bib/conf/paper["XMI_"]["l_evy"] union 




"ICDE" fn(4) ln(5) title(7) author(8) title(12) author(13) title(57) author(58) title(62) author(63) author(66) 




" XML " fn(9) ln(lO) "Web" fnf!4) ln(15) " XML "author(104) author(107) 



II II /\ /\ 

"A" " Levy " "H" "Jagadish" fn(l05) ln(l06) £n(l08) ln(l09) 

I I I I 

"A" " Levy " "H" "Lu" 

(b) XML data on conference and journal publications. 

Figure 1. Querying XML data. 

In the keyword search, a user wants to have high recall and high precision j5] . A naive way to 
achieve high recall (100%) in XML keyword search would be to return the root of an XML document. 
However, with this approach, the user would suffer from very low precision due to a large amount of 
spurious results not intended by the user. 

Efforts to eliminate spurious results 11, 15, 28, 46 have enjoyed some success by using the concepts 
of LCA or its variants, SLCA [46] and MLCA [28]. For a keyword query Q = {wi, W2, w m }, an LCA is 
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the common ancestor node of nodes m, U2, n m where Hi is a node directly containing Wi (l<i<m). 
It is located farthest from the root node. The SLCA method, a refinement of the LCA method, finds 
LCAs that do not contain other LCAs. For example, if we use the LCA method to find the results in 
Fig. QJa), {bib(O), conf(l), paper(6), conf(51)} are retrieved. With the SLCA me-thod, |paper(6), conffsi)} 
are retrieved. As shown here, existing methods for XML keyword search still could find many spurious 
results (e.g., {bib(O), conf(l), conf(5l)}), i.e., those that are not intended by the user. Here, following 
the common practice I26[ 128) , we define correct results of a keyword query as those returned by 
structured queries (such as Q\) corresponding to the keyword query, which are formulated according 
to the schema of the underlying XML data. In the real data set (DBLP), spurious results such as 
conf(51) can include huge subtrees having thousands of nodes. This serious problem of low precision in 
the-state-of-art methods not only overburdens the user with filtering numerous spurious results, but 
also degrades the performance of the system due to unnecessary computation. For instance, if we issue 
a keyword query "XML Levy" over the DBLP data set, we obtain 388,066 nodes using the SLCA method, 
among which only 69 nodes (precision = 38 g q 66 ~ 0.02%) are correct results. 

The fundamental cause for the occurrence of spurious results is that the existing methods try to 
eliminate spurious results locally without global examination of all the query results. For instance, in 
Example [TJ the LCA method finds a correct result {paper(6)}, but also finds spurious results {bib(O), 
conf(l), conf(5l)}. With the SLCA method, we can eliminate two spurious results {bib(0), conf(l)} since 
they contain other LCAs. However, conf(5l) still remains since it is not an ancestor of paper(6). This is 
inconsistent since both conf(l) and conf(5l) are spurious results having an identical result structure. Here, 
we define the result structural of a query result qr as a (schema-level) twig pattern composed of the 
label path [14j from the root of the XML data to the root qr root of qr (simply, the incoming label path) 
and the ancestor-descendant edges from qr root to query keywords. In the result structure of a query 
result qr, denoted by rs(qr), the node corresponding to qr root is marked as the query result node [35j 
and is distinguished from other nodes by placing it in a box. Fig. [2] shows rs(conf(51)) and rs(paper(6)). 



2 Intuitively, the result structure is the schema of a query result (an instance). 
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(a) rs(conf(51)). (b) rs(paper(6)). 

Figure 2. The result structures of query results. 
We observe that, if two query results have an ancestor-descendant relationship at the schema level, 
the ancestor is spurious. We call this phenomenon structural anomaly. Here, a query result qr\ is an 
ancestor of a query result qr-i at the schema level if and only if the incoming label path of rs{qr\) is 
a proper prefix of that of rs(qr2). By examining the query results at the schema level, we can remove 
spurious results having the same result structure consistently. For example, in Fig. [TJa), the query 
results of the SLCA method are {paper(6), conf(5l)}, and the incoming label path of rs(conf(5l)) is a 
proper prefix of that of rs(paper(6)) as in Fig. [2j Hence, conf(5l), which has the same result structure as 
conf(l), is spurious. 

We argue that, to improve precision, there should be no structural anomaly in the query results. 
We call this property structural consistency (to be defined more formally in Section f3.1|) . Otherwise, we 
are bound to retrieve inconsistent spurious results. 

In this paper, we resolve structural anomalies by exploiting the notion of the smallest result struc- 
ture. The smallest result structure is defined to be a result structure whose incoming label path is not a 
proper prefix of those of any other result structures. We then remove the query result whose structure 
is not the same as a smallest result structure, thereby obtaining structural consistency. For example, 
the smallest result structure of {paper(6), conf(5l)} is 7-s(paper(6)) in Fig. EJb) since the incoming label 
path of rs(paper(6)) is not a prefix of that of rs(conf(5l)). Thus, conf(5l) is removed. 

A naive instance- level approach to obtain structural consistency would be to compute all the LCAs 
(or variants) and then to remove spurious results according to structural consistency. Obviously, this 
approach would always be slower than existing LCA-based ones. To speed up structural consistency 
checking, we must examine the query results at the schema level without generating all the LCAs. 
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The challenging issue here is "How do we formally guarantee that the schema-level approach 
produces the same query results as the instance-level approach does?" That is, if we blindly find 
SLCAs at the schema level and compute answers using the SLCAs, we may encounter a false dismissal 
problem (to be elaborated in more detail in Section I3.2.2[) . For example, an empty result can be 
obtained even though query results corresponding to smallest result structures exist as in Example [3J 
We may also encounter phantom schema-level SLCAs (to be defined in Section l3.2.2j) . which incurs 
structural anomaly. These problems occur because the schema-level SLCAs do not homomorphically 
map to the instance-level SLCAs. As a solution to these problems, we introduce the concept of iterative 
kth- ancestor generalization, which iteratively finds the Ath-ancestors of SLCAs at the schema level and 
removes phantom schema-level SLCAs. Through iterative fcth-ancestor generalization, the schema-level 
definition of structural consistency becomes equivalent to the instance-level one, and we formally prove 
this equivalence in Theorem Q] of Section 13.2.41 

Example 3 Consider a keyword query Q — {"Levy", "Lu"} issued on the XML data in Fig.QJa). In the 

XML data in Fig.[T](a), we see that there is a query result, paper(6l), corresponding to the smallest result 

structure shown in Fig.[3](a). However, there is no query result corresponding to the XPath query shown 

in Fig. EJb) that is obtained from the schema-level SLCA. (We will formally define the schema-level 

SLCA in Section EXU) □ 

bib 

bib conf 



conf 

. I , 

paper author 



"Levy " "Lu " 



paper 
autho 



"Levy" "Lu" 



(a) The smallest structure. 



(b) The XPath query obtained 
from the schema-level SLCA. 

Figure 3. An example of false dismissal. 

The contributions of this paper are as follows: 1) we formally propose new notions of structural 
consistency and structural anomaly; 2) we formally analyze the relationship between the set of schema- 
level SLCAs and the set of instance-level SLCAs, and then, propose an efficient algorithm that resolves 



structural anomaly at the schema level using the relationship analyzed, (we call this algorithm schema- 
level structural anomaly resolution.)] 3) we formally prove in Theorem [1] that this algorithm preserves 
structural consistency as is originally defined at the instance-level without incurring false dismissal; 4) we 
propose a relevance-feedback base solution for the problem where our method has low recall, which occurs 
when it is not the user's intention to find more specific results.; 5) we propose an efficient algorithm that 
simultaneously evaluates the multiple XPath queries generated by our method; 6) we have prototyped 
this algorithm in a full-fledged object-relational DBMS [44) : 7) we perform extensive experiments using 
real and synthetic data sets. The results show that we can significantly reduce spurious results compared 
with the existing methods by exploiting structural consistency. Furthermore, the experimental results 
show that our schema-level algorithm significantly improves the query performance over the existing 
ones. 

The rest of this paper is organized as follows. Section [2] describes the XML data model, schema 
of XML data, query models, and quality measure of XML keyword search. Section [3] proposes the 
concept of structural consistency and schema-level structural anomaly resolution. Section [4] presents 
the implementation of schema- level structural anomaly resolution. Section [5] reviews existing work, and 
Section [6] presents the experimental results. Finally, Section [7] presents our conclusions. 

2 Background 

2.1 XML Data Model 

We model XML data as a labeled tree pTJ [28l [3lJ US] where a node represents an element, attribute, 
or value, and an edge represents the parent-child relationship between two nodes. Every element or 
attribute node has a label and a unique id, and each id is assigned a preorder number. A node that has 
a label I and an id i is denoted as l(i). Definition Q] defines the label path of a node, and Definition [2] 
the node path. 
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Definition 1 [14] The label path of a node o is defined as a sequence of node labels h,h, ■■■,l m from 
the root to the node o, and is denoted as l\ l^- ■ ■ ■ -l m . D 

Definition 2 [35] The node path of a node o is defined as a sequence of node identifiers ni,n2, ■■■,n m 
from the root to the node o, and is denoted as n\.ni- ■ ■ ■ .n m . We denote the ith id of a node path 
nodejpath as node_path[i\. We note that the ids ni,n2, ...,n m have an ascending order since each rii 
(l<i<m) is assigned a preorder number. □ 

2.2 Schema of XML Data 

Although DTD or XML Schema are used as the schema of XML data, XML data often do not have 



them [T2] • For schemaless XML data, we can derive a schema from XML data using the DataGuide 



The DataGuide is a labeled tree that has every unique label path of XML data. In a DataGuide, a node 
represents the label of an element (or attribute) , and an edge represents the parent-child relationship 
between two nodes. A node in a DataGuide is uniquely identified by its label path. In this paper, 
we augment the DataGuide with keywords contained in value nodes to support keyword queries at the 
schema level. We call the augmented DataGuide DataGuide + and use it as the schema. Every non-value 
node in a DataGuide" 1 " is assigned a preorder numbe!*]. Hereafter, we call a node of the DataGuidc + a 
schema node to distinguish it from a node of XML data, which we call an instance node. For ease of 
explanation, we may refer to a schema node by its label path. 

Example 4 Fig. Q] shows the DataGuide + for the XML data in Fig. [Tfb). Every unique label path 
of the XML data appears exactly once in the DataGuide" 1 ". For example, in the XML data, the label 
path "bib. conf.paper. author" appears twice, and so does "bib.journal. article. authors. author" . In contrast, in the 
DataGuide" 1 ", each appears only once. □ 



3 Recently, Bex et al. [7] have proposed algorithms for the inference of XML Schema Definitions, but we use the 

DataGuide since it takes linear time to create and has sufficient power for checking structural consistency. If a DTD or 

XML Schema are given along with XML data, we can exploit the given schema. 

4 We can use other numbering schemes without loss of generality. For example, to handle schema evolution, we can use 

Compact Dynamic Quaternary String (CD QS) encoding 1251 . which allows for updates without the original nodes having 

to be renumbered. 
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Figure 4. An example DataGuide + . 



2.3 Query Models 



2.3.1 Keyword Query 



We model a keyword query as a set of keywords [31] . As in the literature [6] [19l [20] [21] [311 [32] [46] , each 
query keyword may match (1) labels of elements or attributes or (2) keywords contained in value nodes 
of the XML data. 



2.3.2 XPath Query 

We consider a subset of XPath that uses the child ("/") and descendant ("//") axes an d predicates 
("[]"). We model a query that belongs to this set as a twig pattern [TO]. In the twig pattern a node, called 
a query node [TO] . represents a label (or a value), and an edge represents the parent-child or ancestor- 
descendant relationship between two nodes. One node of the twig pattern is marked as the query result 
node 35] and is distinguished from other nodes by placing it in a box. A query node that has more 
than one child node is called a branching query node [35j . A leaf node of the twig pattern is called a leaf 
query node. 

Example 5 Fig. [5] shows an example twig pattern that represents the XPath query Q\. In Fig. [5] paper 
is the query result node and, at the same time, the branching query node. Keywords are located in leaf 
query nodes "XML" and "Levy". 

Q\: /bib/conf/paper["XML"]["Levy"] □ 
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Figure 5. An example twig pattern. 
2.4 Quality Metrics of XML Keyword Search 



As quality metrics for keyword queries, we use precision and recall, which have been widely used in the 
field of information retrieval (IR) . Formula ([1]) shows the definitions of precision and recall [5] . Here, R 
is the set of nodes relevant to the query (i.e., desired results) in the database, and A is the set of nodes 
retrieved as the answer to the query (i.e., actual query results). Precision is the fraction of the retrieved 
nodes (i.e., A) that are relevant, and recall is the fraction of the relevant nodes (i.e., R) that have been 
retrieved. The search quality is good when both precision and recall are close to 1.0 [5]. 



\RnA\ \Rf)A\ 
precision = — — — , recall — — i — 1 
|A| \R\ 



3 Structural Consistency 

In this section, we formally define the notions of structural consistency and structural anomaly in XML 
keyword search. We also propose an efficient algorithm that resolves structural anomaly at the schema 
level. 

3.1 The Concept 

We first define the result structure of a query result in Definition [3] Here, a query result is a subtree 
rooted at an SLCA in the XML data. We define structural containment and structural equivalence of 
result structures in Definition 01 We then define the structural consistency and the structural anomaly 
in Definition [5l 
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Definition 3 The result structure of a query result qr, denoted as rs(qr), is a (schema-level) twig 
pattern composed of the label path from the root of XML data to the root qr root of qr (simply, the 
incoming label path) and the ancestor-descendant edges from qr roo t to query keywords. In the result 
structure rs(qr), the node corresponding to qr root is marked as the query result node. □ 

In Definition^ we note that the incoming label path information is sufficient to define the structural 
consistency, but we attach query keywords to find query results corresponding to the result structure in 
query processing. 

Example 6 Suppose that a keyword query Q = {"XML", "Levy"} is issued on the XML data in Fig.QJa). 
Fig. [6] shows a query result paper(6) and its result structure. Note that a query result is a subtree of 
XML data (i.e., an instance), and its result structure is a twig pattern (i.e., a part of schema). □ 



paper 
title author 



"XML" fn In 
I 



bih 

conf 

I 

paper 



incoming 

label 

path 



"A" " Lew " "XML" "Levy" 

(a) A query result paper(6). (b) rs(paper(6)). 

Figure 6. The result structure of a query result paper(6). 

Definition 4 Given a keyword query Q and the set of query results QR = {qri, qr 2 , ■ qr m } of Q, the 
result structure rs{qri) structurally contains the result structure rs(qrj), as denoted by rs(qrt) -< rs(qrj), 
if and only if the incoming label path of rs(qri) is a proper prefix of that of rs(qrj). rs(qri) and rs(qrj) 
are structurally equivalent, as denoted by rs{qri) = rs(qrj), if and only if their incoming label paths are 
identical. We define rs(grj) ^rs(qrj) as rs(qri) -<rs{qrj) or rs(grj) = rs(qrj). □ 

Definition 5 Given a keyword query Q and the set of query results QR— {qr 1 ,qr 2 , qr m } of Q, 
structural consistency is a property where the following condition is satisfied for QR: (Vgr^GQi?) 
((-3qrj€QR)(rs(qri) -< rs(qrj))). Structural anomaly is a property where structural consistency is vio- 
lated, i.e., (3qn, 3qrj G QR) (rs(qn) -< rs(qrj)). □ 
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Example 7 Suppose that a keyword query Q = {"XML", "Levy"} is issued on the XML data in Fig.QJa), 
and that a set of query results QR — {conf(5l), paper(6)} is obtained. Fig. [7]shows their result structures. 
We see that rs(conf(5l)) -< rs(paper(6)). Thus, QR has structural anomaly. □ 



bib — | incoming bib 
label I 



conf P path conf 

, I 



"XML" "Levy" 



paper 



incoming 

label 

path 



"XML" "Levy" 
(a) rs(conf(51)). (b) rs(paper(6)). 

Figure 7. The result structures of query results causing structural anomaly. 

We resolve structural anomaly, thereby preserving structural consistency, by removing query results 
whose structure is not the same as a smallest result structure as defined in Definition [5J By enforcing 
structural consistency, we can remove spurious results having the same result structure consistently. 

Definition 6 Given a keyword query Q and the set of query results QR — {qr±, qr^, qr m } of Q, the 
set of smallest result structures of QR is {rs(qri) \ qri G QRA (Sqrj G QR) (rs(qri) -< rs(qrj))} □ 

In Definition [51 "smallest" refers to the resulting subtrees since resulting subtrees are smaller if 
their incoming label paths are longer. 

Lemma 1 Given a keyword query Q, the set of query results QR — {qi\, qr2, qr m } of Q, and the 
set of smallest result structures SRS — {srs±, srs2, srs n } of QR, structural consistency holds for QR 
if the following condition is satisfied for QR: (Vgrj G QR)((3srSj G SRS)(rs(qri) = srsj)). 

Proof: It is straightforward from the definition of the smallest result structure. □ 

Fig.[8]shows a naive algorithm that resolves structural anomaly at the instance level. The algorithm 
consists of the following four steps: (1) computing all the SLCAs, (2) finding smallest result structures 
of the SLCAs, (3) removing SLCAs whose result structures are not smallest result structures, and (4) 
returning the set of SLCAs preserving structural consistency. 
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Algorithm 1 Naive Structural Anomaly Resolution 

Input: (1) a keyword query Q, (2) XML data D 
Output: the set QR of query results of Q preserving structural 
consistency 

Algorithm: 

Step 1. Compute the set QR of SLCAs of Q on D 

Step 2. Find the set SRS of smallest result structures of QR 

2. 1 For each qr t e QR, obtain rs(qr) and add it to SRS 

2.2 Remove all srs k e SRS from SRS such that 

(3srSje SRS)(srs t ~<srsp 
Step 3. Remove all qr i e QR such that (—3srSje SRS)(rs(qr t ) = srsj) 
Step 4. Return gi? 



Figure 8. A naive algorithm for resolving structural anomaly. 
3.2 Schema-level Structural Anomaly Resolution 



Obviously, the naive algorithm would always be slower than existing SLCA-based algorithms. We 
propose an efficient algorithm, called schema-level structural anomaly resolution, that resolves structural 
anomaly at the schema level. In this algorithm, we first find smallest result structures at the schema 
level. We then compute only those query results that correspond to the smallest result structures by 
evaluating structured queries constructed from the smallest result structures. We prove in Section [3.2.4l 
that we can find the smallest result structures using the schema without incurring false dismissal. To do 
that we first define the schema-level SLCA in Section [3.2.1l We then formally analyze the relationship 
between the set of schema-level SLCAs and the set of instance-level SLCAs in Section 13.2.21 Through 
analysis, we show that simple query evaluation using the schema-level SLCAs cannot obtain the same 
query results as the instance-level algorithm does. In Section [3.2.3[ we present a solution for this problem, 
which we call iterative kth-ancestor generalization. In Section [3.2.41 we present a novel algorithm that 
resolves structural anomaly at the schema level using the schema-level SLCAs and iterative fcth-ancestor 
generalization. We finally prove in Theorem [T] that the schema-level algorithm and the instance-level 
algorithm produce an equivalent set of query results that preserve structural consistency. 



3.2.1 Schema-level SLCA 



We first define the schema-level LCA in Defmition[7]and then define the set of schema-level SLCAs in Def- 
initionO In contrast, we call SLCAs in the XML data instance-level SLCAs. Hereafter, ancestors a , s) 
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denotes that node s a is an ancestor of node s, and ancestor- or- s elf (s a , s) denotes that ancestor(s a , s) or 

Sa=S. 

Definition 7 Let G be a DataGuide 4 " and S be the set of all schema nodes in G. For n schema nodes 
s\, S2, ■ s n £ S, s a £ S is the schema-level LCA of these n schema nodes if and only if the following 
conditions are satisfied: (1) (Vl<i<n) {ancestor- or- s elf(s a , Sj)), (2) (— i3sj, S S) (ancestors a , Sb) A 
(\/l<i<n) (ancestor- or- self(sb, Sj))). The schema-level LCA s a for si, S2, s n is denoted as LGA(si, s 2 , 
□ 

We note that, in Definitional the LG^4 is defined for n schema nodes; in Definitional the LCA_SET 
is defined for m sets of schema nodes. Given a keyword query Q = {wi, 11)2, w m } and a DataGuide + 
G, Si (l<i<m) denotes the set of schema nodes directly containing to< in G. 

Definition 8 Given a keyword query Q = {wi, W2, w m } and the set S of all schema nodes in a 
DataGuide+ G, toe sei of schema-level SLCAs SLCA.SET{Si, S 2 , S n ) = {s a \ (s a e LCAJSET(S l ,S 2 , 
S n ))A(^3s b £ LCAJSET(Si, Sa, Sn)) (ancestor(s a , s b ))} where LCA_SET{S 1 ,S 2 , S m ) = 
{s a I (s a £ S) A (3 81 eSi, 3 s 2 £ S 2 , 3 s m £ S m )(s a = LCA(si, s 2 , s m )). □ 

Example 8 Suppose that a keyword query Q = {"XML", "Levy"} is issued on the XML data in Fig.QJb). 
In the DataGuide + in Fig. 01 the set of schema-level LCAs is {"bib", "bib.conf", "bib.conf.paper" , "bib.joumal" , 
"bib.journal. article" }, and the set of schema-level SLCAs is {"bib.conf.paper", "bib.journal. article" } since these 
schema nodes do not contain other schema-level LCAs. □ 

3.2.2 The Relationship between the Set of Schema-level SLCAs and the Set of Instance- 
level SLCAs 

To explain the relationship between the set of schema-level SLCAs and the set of instance- level SLCAs, 
we first define the schema structure of a schema node in Definition [9] Since both the schema structure 
of a schema node and the result structure of a query result are defined as twig patterns, we will use the 
same notions of structural equivalence and structural containment for schema structures. 
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Definition 9 The schema structure of a schema node s, denoted as ss(s), is a twig pattern composed 
of the incoming label path from the root of DataGuide + to s and the ancestor-descendant edges from s 
to query keywords. In the schema structure ss(s), the node corresponding to s is marked as the query 
result node. □ 

Given a keyword query, the set SS of schema structures of schema-level SLCAs is largely equivalent 
to the set SRS of smallest result structures of instance-level SLCAs. However, there exist cases where 
SS and SRS are not equivalent since the schema loses some instance-level information by storing only 
unique label paths of the instance nodes. For example, in the XML data in Fig. [3a), "Levy" and "Lu" 
appear in the instance nodes with the label path "bib.conf.paper. author.ln", but they appear in different 
instance nodes, ln(65) and ln(68). Nonetheless, in the DataGuide 4 " in Fig. [4] they appear in the same 
schema node with the label path "bib.conf. paper.author.ln" since their label paths are the same. Thus, in 
effect, the schema loses the information that "Levy" and "Lu" appear in different instance nodes with the 
same label path. 

There are two cases where SRS and SS are not equivalent: case 1) for some ssj 6 SS, there exists 
an srsi G SRS such that srsi -< ssj, and case 2) for some ssj € SS, there exists no srs,; G SRS such that 
srsi ■< ssj. We note that ssj ~< srsi does not hold according to the definition of the schema-level SLCA. 
In case 1, if we compute query results corresponding to ssj, we will miss query results corresponding to 
srsi, i.e., we will incur false dismissal. Example [9] shows an instance of false dismissal. In Section f3.2.3l 
we propose a solution to this problem, which we call iterative kth- ancestor generalization. In case 
2, if we blindly apply iterative fcth-ancestor generalization for ssj, we could end up with incurring 
structural anomaly. We call ssj G SS such that (Ssrsi G SRS)(srsi ^ ssj) a phantom schema structure. 
Example [10] shows an example of the phantom schema structure. In the next section, we will provide a 
solution to eliminate phantom schema structures. 

Example 9 Consider a keyword query Q = {"Levy", "Lu"} issued on the XML data in Fig. QJa). 
Figs. OJa) and (b) show srsi<ESRS and ssj^SS, respectively. Here, srsi^,ssj. In the XML data 
in Fig. HJa), we see that there is a query result corresponding to srs,;, paper(6l), but there is no query 
result corresponding to ssj. □ 
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bib 
conf 
paper 
author 



bib 
I 

conf 

. I . 

paper 

"Levy " "Lu " 



"Levy" "Lu" 
(a) srsi. (b) ssj. 

Figure 9. An example of false dismissal. 
Example 10 Suppose that a keyword query Q = {"XML", "IR"} is used. In the XML data in Fig. fTOT a). 
SRS — {rs(vi)}. In the DataGuide + in Fig. ITOT b). SS = {ss(si), ss(s2)}- Thus, we do not have an srs 
rs(i>2) such that rs(v2) di ss(si), and ss{s2) is a phantom schema structure. In this case, if we applied 
fcth-ancestor generalization to s%, we would find conf(l) in Fig. HOf a) as a result, which causes structural 
anomaly because rs(conf(l)) ~< rs(vi). □ 



bib(O) 




bib(O) 
I 

conf(l) 




" XML IR" fn(6) ln(7) 

I I 

"H" "Jagadish" 

(a) XML data. 



ftitle(2J) paper(3) 
XML IR (Etle(4j) author(5) 

7\ /\ 

XML IR fn(6) ln(7) 

I I 

H Jagadish 

(b) The DataGuide+ for the XML data in (a). 



Figure 10. An example of a phantom schema structure. 

We now formally state the relationship between SRS and SS, which will be used in iterative 
fcth-ancestor generalization. 



Lemma 2 Given a keyword query Q, for all srsi G SRS, there exists ssj S SS such that srsi ^ ssj 
Proof: See Appendix A. 



□ 



We can obtain srs; G SRS by computing the set QRj of the query results corresponding to ssj G SS. 
If QRj is non-empty, then we have obtained srsi G SRS such that srsi = ssj . If QRj is empty, we can 
obtain srsi G SRS such that srsi -< ssj by applying iterative fcth-ancestor generalization. 
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3.2.3 Iterative fcth-Ancestor Generalization 



In this section, we present iterative kth-ancestor generalization to solve the problems of false dismissal 
and phantom schema structures. Here, we iteratively find a fcth-ancestor s a of the schema-level SLCA 
s such that ss(s a ) =srs E SRS where srs -< ss(s). We define the kth-ancestor in Definition 1101 

Definition 10 Given two nodes, s a and s, s a is the kth-ancestor of s if s a is an ancestor of s and 
depth(s) — depth(s a ) + k where depth(s) is the length of the path from the root to s. □ 

Example 11 We can obtain srsi G SRS in Fig. [9ja) by finding the 2nd- ancestor of the schema- level 
SLCA in Fig. Ob). □ 

Lemma 3 Given a keyword query Q, suppose that srsi E SRS structurally contains ss(s) G SS, i.e., 
srsi -< ss(s). Then, there must exist a fcth-ancestor s a (1 < k < depth(s)) of s such that ss(s a ) = srst G S"i?5 

PROOF: See Appendix B. □ 

In iterative fcth-ancestor generalization, we iteratively find the fcth-ancestor s a of the schema-level 
SLCA s from the parent of s (i.e., fc = l) until the set of the query results corresponding to ss(s a ) is 
non-empty. Here, obtaining non-empty results indicates that srs G SRS has been found. Thus, we solve 
the false dismissal problem. 

To eliminate phantom schema structures during iterative fcth-ancestor generalization, we need to 
iteratively check structural consistency. Initially, there is no structural anomaly for the set of schema- 
level SLCAs. As schema-level SLCAs are generalized, structural anomaly can be incurred by their 
ancestors in the schema. Then, computing query results corresponding to the fcth-ancestor incurring 
structural anomaly in the schema will incur structural anomaly in the instances. For example, in 
Fig. [TOT b). the schema structure of the lst-ancestor of S2, ss(conf(l)), structurally contains the schema 
structure ss(si) of the schema-level SLCA si. In this case, if we compute query results corresponding to 
ss(conf(l)), we obtain conf(l) in Fig. [TOTa). Here, rs(conf(l)) -<rs(v\) causing structural anomaly. Thus, 
we iteratively remove ancestors incurring structural anomaly and stop applying generalization for them. 
That is, we remove phantom schema structures. 
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We note that one srsi £ SRS can structurally contain multiple schema structures ss(s\), ss(s2), ■ 
ss(s„) £ SS. In such cases, if we blindly generalize all the schema-level SLCAs si, S2, s n , we obtain 
duplicate query results corresponding to srsi. Thus, we must generalize only one schema-level SLCA 
for srsi. This constraint is also enforced by iteratively checking structural consistency. Suppose that 
si, S2, s n are being generalized to srsi in this order. It is clear that Sj (l<i<ro-l) will be removed 
since Sj, when sufficiently generalized, must become the ancestor of s n . Therefore, we can guarantee 
that only one schema-level SLCA, s n , is generalized. 

3.2.4 Putting It Altogether 

Fig. [Tl] shows an enhanced algorithm that resolves structural anomaly at the schema-level using the 
schema-level SLCAs and iterative fcth-ancestor generalization. This algorithm produces the same query 
results as the instance-level algorithm in Fig. [8] does. We will present the detailed query processing 
method of this algorithm in Section [H Step 1 finds the set of schema-level SLCAs S unmar ked — {^lj S2 ; 

s m }, and Step 2 computes the set of the query results corresponding to ss(si) (l<i<m) by evaluating 
the XPath query that represent ss(-Si). Here, we convert ss(-Si) to an XPath query to make our method 
run on top of any query evaluation engine that supports XPath. Step 3 applies iterative fcth-ancestor 
generalization for Si £ Sunmarked- In Step 3.2.1.1, we check whether an srs £ SRS such that srs = ss(si) 
has been found by examining whether QRi is non-empty. If it has, in Step 3.2.1.1.1, we move such Si 
to Smarked- If not, in Step 3.2.1.2.1, we obtain the parent of Sj using the parent(-Si) function. In Step 
3.2.1.2.2.1, we remove Sj, which incurs structural anomaly, from S unma rked- 

Example 12 Suppose that a keyword query Q = {"XML", "IR"} is used to query the XML data in 
Fig, riffl a). In Step 1, S unmar ked — {si, S2}. In Step 2, the set QR\ of the query results corresponding 
to ss(si) is non-empty ({title(4)}), but QR2 for ss(s2) is empty. In Step 3.2.1.1, since QR± 7^ {}, we 
move si from S unmarked to S marked and add QRi to the set QR of query results. Hence, S unmarked = 
{•S2}, S mar ked = an d QR = {title(4)}. In Step 3.2.1.2, since QR2 = {}, we generalize S2- Now S2 

incurs structural anomaly since (3s± £ S ma rked)( ss ( s 2) -4 ss(si)). Thus, we remove S2 from S unrnar ked- 
Now S unm arked — {}, and we end the iteration. 
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Algorithm 2 Schema-level Structural Anomaly Resolution 

Input: (1) a keyword query Q, (2) XML Data D, 

(3) the DataGuide + G for D 
Output: the set QR of query results of Q preserving structural 
consistency 

Algorithm: 

Step 1. Find the set of schema-level SLCAs S umarka p{s 1 , s 2 , s m } 
of Q on G 

Step 2. Compute the set QR t of the query results 

in D corresponding to ss(s) (l<j'<m) using 
the query processing method in Section 4 

Step 3. Apply iterative Mi-ancestor generalization 

3.1 QR--={};S marked :={}/* initialize*/ 

3.2 Repeat until S_ rked *{} 
3.2.1 For each s,.e S unmarked 

I* check if an srs e SRS such that srs = ss(s) has been found */ 

3.2.1.1 If QR^ {} Then 

/* an srs = ss(Sj) has been found */ 

3.2.1.1.1 Move s, from S unmarkei to S marked 

I* add the query results corresponding to srs to QR */ 

3.2.1.1.2 QR :=QR U QR t 

3.2.1.2 Else /*QS ; ={}*/ 
/* generalize s t */ 

3.2.1.2.1 j ; := parentCs,) 

/* check structural consistency */ 

3.2.1.2.2 If <3s k G S mark J( SS (s,) < S s(s k )) v 

(3s,e S unmarked )(ss(s^ -< ss(s,)) Then 
/* incurs structural anomaly */ 
3.2.1.2.2.1 Remove s, from S unmarked 

3.2.1.2.3 Else 

3.2.1.2.3.1 Compute the set QR t of the query results 
in D corresponding to ss(s) using 
the query processing method in Section 4 



Figure 11. The algorithm for resolving structural anomaly at the schema-level. 

In Step 3, even if we process S2 first, we can obtain the correct result without a problem. In Step 
3.2.1.2.2, S2 incurs structural anomaly since (3si £ S unmar k e d){ss(s2) -< ss(si)). Thus, we remove S2 
from S unm arked obtaining S unmar ked = {si} and S marke d = {}■ Now we move Si from S unmark ed to 
Smarked, add QR\ to QR, and end the iteration. □ 

Theorem 1 The Schema-level Structural Anomaly Resolution algorithm produces the same query re- 
sults as the instance-level algorithm in Fig. [8] does. 

Proof: By Lemma[2j for every srsi £ SRS, there exists ss(sj) £ SS such that (1) srsi = ss(sj) or (2) 
srsi -< ss(-Sj). For case 1, we can obtain srsi G SRS by computing the query results corresponding to 
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ss(sj) (Step 2). For case 2, we can obtain srsi G SRS by applying iterative fcth-ancestor generalization 
according to LemmaO (Step 3). In this case, even if generalization is stopped for sj because of incurring 
structural anomaly, we are still able to obtain srsi G SRS since there always exists a schema-level 
SLCA s n such that ss(sj) -< ss(s n ) — which is exactly what caused the structural anomaly — and we 
can find srsi by generalizing s n . Finally, ss(sj) G SS such that (—dsrsi G SRS)(srsi ^ ss(sj)), i.e., the 
phantom schema structure, is always removed since the fcth-ancestor s a of Sj must eventually incur 
structural anomaly when Sj is generalized to the root node. Otherwise, we contradict the assumption 
(-iBsrSi G SRS)(srsi ^ ss(sj)) since it must be that srsi = ss(s a ) at the root node. □ 

We now analyze the complexity of our schema- level algorithm. Given a keyword query Q = {wi , W2 , 
u>„}, the worst case time complexity of the schema-level algorithm is 0(|Si|d J^._ 2 log\Si\ + dCxp a th) 
where Si (l<i<n) is the set of schema nodes directly containing the query keyword u>j in the DataGuide+, 
d the maximum depth of the XML data, and CxPath the cost of XPath query evaluation, which will 
be presented in Section [4.2.21 Here, 0(\Si\d^2™ =2 log\Si\) [JS] is the cost of computing schema-level 
SLCAs using the algorithm of Xu and Papakonstantinou [46j . and 0(dCxp a th) is the cost of itera- 
tive fcth-ancestor generalization since, in the worst case, generalization can be applied until one of the 
schema-level SLCAs reaches the root node. 

Compared with the existing instance- level SLCA algorithm [46] , the schema-level algorithm is gen- 
erally more efficient since it avoids unnecessary computation of spurious results by removing them 
early at the schema-level. The additional overheads of the schema-level algorithm are the computation 
of schema-level SLCAs and iterative fcth-ancestor generalization. However, those overheads are small 
in practice. First, the cost of the schema-level SLCA computation tends to be very small since the 
schema is generally several orders of magnitude smaller than the XML data [4] . Second, the cost of 
iterative fcth-ancestor generalization is negligible since the generalization occurs only occasionally and 
is usually applied only once or twice. (According to our experiments in Section [6l the cost of iterative 
fcth-ancestor generalization is less than 10% of the total query processing cost.) In the worst case, 
however, our schema- level algorithm could be about twice slower than the instance-level SLCA algo- 
rithm. The reasons are as follows. First, when the schema is as large as the XML data, the overhead 
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of schema-level SLCA computation would be almost the same as the cost of the instance-level SLCA 
computation. Second, after obtaining the schema-level SLCAs, we compute query results that corre- 
spond to the schema-level SLCAs by evaluating the XPath queries. This query evaluation could also 
be as expensive as the instance-level SLCA computation if there exist few spurious results since then 
our method loses the benefit over existing SLCA-based methods of avoiding unnecessary computation 
of spurious results through early removal. (See the experimental results of QD\ and QD$ in Fig. I23f c) 
and QXi and QX S in Fig. [27(c) of Section El) 

3.3 A Relevance-Feedback Based Solution for the Low Recall Problem 

When users intend to find more general results (although this is relatively rare), which we regard as 
spurious results, our method can have lower recall than existing methods. For example, suppose that a 
user intends to find a conference on "XML" where "Levy" is the chair. If there is at least one paper about 
"XML" authored by "Levy", our method does not retrieve the desired conference. We call this problem 
the low recall problem. 

The fundamental cause for this problem is the inherent ambiguity in keyword search, i.e., the actual 
intention of the user is unknown. We can solve this problem by exploiting the user's relevance feedback. 
Relevance feedback is an important way of enhancing search quality by using relevance information 
provided by the user p~6l[37]. The solution is as follows. The initial query results are presented to the 
user, and the user gives feedback if desired results are not retrieved. (This kind of relevance feedback can 
be easily implemented using a user-friendly GUI, and users just need to click a button.) This feedback is 
sent to the system, and the system generalizes the smallest result structure and finds results again. (We 
can repeat this feedback process until all the desired results are retrieved.) For example, our method 
does not retrieve the desired conference if there is at least one paper about "XML" authored by "Levy". 
Since the desired result has not been retrieved, the user sends feedback to the system, and the system 
now finds conferences containing "XML" and "Levy" by generalizing the smallest result structure. Then, 
the user can obtain the desired result. When there are multiple smallest result structures, we can allow 
the user to choose which smallest result structure he wants to generalize. To do this, we need to group 
the query results for each smallest result structure and show each group to the user. 
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We implement this relevance-feedback based solution by modifying Algorithm 2. In Step 3.2.1.1 of 
Algorithm 2, we check whether the set QRi of the query results corresponding to a schema-level SLCA 
Si is non-empty. If QRi is empty, we generalize Sj in Step 3.2.1.2.1 by finding the parent of Si. We 
implement relevance feedback by modifying Step 3.2.1.1 such that Sj should be generalized even if QRi 
is non-empty when the user's relevance feedback is received. 

The reason why relevance feedback is possible is that we process queries at the schema level. The 
schema-level processing makes the relevance-feedback mechanism feasible since users just need to give 
feedback on a small number of schema-level SLCAs. However, it is hard to apply to instance-level 
methods since the number of instance- level SLCAs is generally much larger than that of schema-level 
SLCAs. Furthermore, it is not clear how we can receive the relevance feedback and generalize the results 
in the instance- level SLCA algorithm [46] . 

We can handle XML data having a recursive schema using the same technique. Fig. [12] shows re- 
cursive XML data where the parent-child relationship between two employees represents the supervisor- 
supervisee relationship. Suppose that the query is "John employee" and the user intends to find all 
employees whose name is "John". In this case, our method (and also SLCA and MLCA) finds only 
employee(3), resulting in low recall. We can also resolve this problem by generalizing the smallest result 
structure via relevance feedback. 

anplo^e(l) 
name(2) employee(3) 
"jolm " name(4) 
"]\in" 

Figure 12. XML data having a recursive schema. 

The low recall problem may also be handled by ranking in a spirit similar to the work of Amer- 
Yahia et al. [3] . Enabling users to exploit partial knowledge of the schema in user queries [IT] [28] [48] 
can also help us to disambiguate user's intention. We leave these issues for future work. 
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3.4 Search Quality Comparisons with Earlier Methods 

In this section, we summarize search quality comparisons with earlier methods, SLCA [46] , MLCA [28] 
(a variant of SLCA), XSEarch p], CVLCA [26], and XReal [6]. XSEarch and CVLCA are based on a 
heuristic called interconnection relationship. According to the heuristic, two nodes are considered to 
be semantically related if and only if there are no two distinct nodes with the same label on the path 
between these two nodes (excluding the two nodes themselves). Li et al. 28] have pointed out that 
the heuristic could retrieve spurious results and have shown that MLCA is generally superior to the 
heuristic. XReal infers the user's intention using the statistics of the underlying XML data. 

Since keyword queries are inherently ambiguous, the desired results of a keyword query depend on 
the user's intention. The user may want to find 1) more specific results or 2) more general (as opposed 
to specific) results. For example, for a keyword query "XML Levy", the user may want to find either 1) 
papers about "XML" authored by "Levy" or 2) conferences on "XML" where "Levy" is the chair. 

When the user's intention is to find more specific results, the precision values of our method are 
higher than or equal to those of existing methods since our method is able to eliminate more spurious 
results (i.e., general results) than existing methods by enforcing structural consistency. In addition, the 
recall values of our method and those of existing methods are the same since our method finds all the 
specific results, i.e., the query results that correspond to smallest result structures, as existing methods 
do. 

Example 13 Suppose that a keyword query Q = {"XML", "Levy", "Lu"} is issued on the XML data in 
Fig.[l3l The user wants to find papers about "XML" authored by "Levy" and "Lu", and the desired result 
is paper(2). SLCA, XSEarch, and CVLCA find not only paper(2) but also spurious (i.e., general) results 
conf(lO) and conf(l7). MLCA can eliminate conf(lO) since in the subtree rooted at conf(10), title(12) and 
title(l5) are the nodes that contain "XML", and speaker(l3) is the node that contains "Levy" and the LCA 

of title(15) and speaker(13), i.e., conf(10), contains the LCA of title(12) and speaker(13), i.e., keynote(ll). XReal 

retrieves {conf(io), conf(l7)} with the ranking since it infers conf as the desired node type! based on the 



5 Since the highest confidence value (2.66) is significantly higher than the second highest value (1.41), XReal chooses 
the one with the highest confidence, conf, as the desired node type and retrieves only conf nodes. 
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XML document frequency [6] . Our method can eliminate all the spurious results by enforcing structural 
consistency. Thus, compared with SLCA, MLCA, XSEarch, CVLCA, and XReal, our method improves 



precision without sacrificing recall. 

bib(O) 

^^mp^^^^.. keynote(ll) paper(14) keynote(18) paper(21) tutorial(24) 

title(3) author(4) author(5) title(12) speaker (13) title(15) author(16) title(19) speaker(20) title(22) author(23) title(25) presenter(26) 

I I I I I I I I I I I I I 

" XML " " Levy " "Lu" " XML " " Levy " " XML " "Lu" " XML " "Widom" "Web" " Levy " "IR" "Lu" 

Figure 13. The case where structural consistency shows high precision. 

bib(O) 

conf(505) 

title(51) conf_year(52) conf_year(70) 
\ can :i ) chain::; papcrCo ... "VLDB" ... paper(60) paper(71) 



□ 



title(2) 





"ICDE" year(4) chair(5) fjiaper(6) 

i i X 

"2000" "Widom " title(7) author(8) "2002" " Levy " title(24) author(25) 

II II 

" XML " " Levy " " XML " "Lu" 



title(61) author(62) title(72) author(73) 

I I I I 

" XML " "Lu" "Web" "Levy" 



Figure 14. The case where structural consistency shows low recall. 

When the user's intention is to find more general results, our method can have lower recall than 
existing methods, and we can solve this problem using relevance feedback. The recall values of our 
method with relevance feedback are higher than or equal to those of existing methods since we can 
eventually obtain the desired results via generalization. In the worst case, however, the precision values 
of our method with relevance feedback could be lower than those of existing methods since it may find 
more spurious results during generalization as we see in Example 1141 We note that the worst case is 

. 

quite rare in practiceu 

Example 14 Suppose that a keyword query Q = {"XML", "Levy"} is issued on the XML data in Fig. [HI 
to find conferences on "XML" where "Levy" is the chair. The desired result is conf_year(20). SLCA and 

MLCA find {paper(6), conf_year(20), conf(50)}. XSEarch and CVLCA find {paper(6), conf_year(20)}. XReal 



6 To find one, we had to test more than one hundred queries that are structurally similar to that shown in Example ll4l 
against the NASA and XMark data sets in Section [6] We were not able to find a similar query in the DBLP data set since 
its structure is simpler than those of the NASA and XMark data sets. 
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finds {conf_year(3), conf_year(20)}. Here, paper(6), conf_year(3) , and conf(50) are spurious results. Our method 
initially finds only {paper(6)}, and thus, the recall of our method is 0. By using relevance feedback, 
our method obtains {conf_year(3), conf_year(20)} through generalization, and thus, the recall becomes 1.0. 
During generalization, our method finds a spurious result conf_year(3), but the precision value of our 
method is higher than those of SLCA and MLCA since the subtree rooted at conf(50) is much bigger 
than that of conf_year(3). However, if we remove the subtree rooted at conf(50) from the XML data (this is 
the worst case of our method) , the precision value of our method can be lower than those of SLCA and 
MLCA. (See Figs.HKa) andH^a) in Section[01) Compared with XSEarch and CVLCA, the precision 
value of our method is lower since our method finds conf_year(3). Compared with XReal, the precision 
value of our method is lower since our method finds paper(6). □ 

4 Implementation 

In this section, we describe the implementation details of the schema-level structural anomaly resolution. 
Section 14.11 presents the index structures used in the query processing. Section 14.21 presents the query 
processing method. 

4.1 Index Structures 

To speed up query processing, we use indexes for the Data-Guide + and XML data. We use an inverted 
index for a Data-Guide + , which we call the schema index, to efficiently compute the schema- level SLCAs. 
We use an inverted index for XML data, which we call the instance index, to efficiently evaluate XPath 
queries. Inverted indexes have been used in many XML query processing methods [TQl [TS, 28 , 35 . We 
also use a table called LabelPath [35 to store all the label paths occurring in the DataGuide 4 " . 

Table [1] summarizes the notation to be used for explaining the index structures. In Table [TJ if a 
schema (or an instance) node s is a value node, we use parent(s) instead of s as a parameter for all 
functions since value nodes themselves do not have ids. 
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Table 1. Summary of notation. 



Symbols 


Definitions 


snode Jd(s) 


the id of a schema node s 


labeLpath(s) 


the label path of a schema 
(or an instance) node s 


label jpathjid(s) 


the id of labeLpath(s) = snodejid(s) 


numeric Jabel_path(s) 


labeLpath(s) represented as a sequence 
of snodejds rather than labels 
(numericJabeLpath(s) [i] denotes 
the ith id.) 


inodejd(o) 


the id of an instance node o 


node_path{6) 


the node path of an instance node o 



A LabelPath table consists of tuples of the form {label jpathjd, label_path), where label_path is the 
label path of a schema node s, and label -path-id is the same as the id of s. A B + -tree index is created 
on the labeljpathSd column, and an inverted index on the labeljpath column. 



Example 15 Fig. [T5l shows the LabelPath table for the DataGuide + in Fig. |U In the DataGuide + , the 
label path of the schema node having the id of 6 is "bib.conf.paper" . □ 



label_path_id 


label_path 


6 


bib.conf.paper 


7 


bib.conf.paper.title 


10 


bib.conf.paper.author.ln 






12 


bib .journal. article 


13 


bib.journal.article.title 


17 


bib journal, article.authors.author.ln 







Figure 15. An example LabelPath table. 



The schema index stores a list of postings for each unique value (or label) that appears in the 
DataGuide" 1 ". The posting of a schema node s has the form (snode_id(s), numeric Jabel jpath(s)) . 
numeric Jabel-path(s) is used to find the ancestor nodes of s. Postings in a posting list are stored in 
ascending order of snode jid(s) . 



Example 16 Fig. [TBI shows the schema index for the Data-Guide" 1 " in Fig. 0] Let s be the schema node 
with the value = "Jagadish" in Fig. 2J Then, snodeSd(s) = 10 and numericJabel-path(s)= 0.1.6.8.10. 
Thus, a posting (10, 0.1.6.8.10) is stored in the posting list of "Jagadish". □ 
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posting list 

<5, 0.1.3.5>, <10, 0.1.6.8.10>, <17, 0.11.12.14.15.17> 
<10, 0.1.6.8.10> 

<7, 0.1.6.7>, <13, 0.11.12.13>, <18, 0.11.18> 

Figure 16. An example schema index. 
The instance index stores a list of postings for each unique keyword (or label) that appears in XML 
data. The posting of an instance node o has the form (inodeJ,d(o) , node_path{o) , numeric Jabeljpath(o)) . 
nodejpath{o) is used to find the ancestor nodes of o, and numeric Jabel-path(o) is used to find the label 
path of o. Postings in a posting list are stored in ascending order of inodeJd(o) . We create a B + -tree 
index, which is called a subindex [431 144j , on each posting list of the instance index in the same way as 
was done by Guo et al. [TS] and Whang et al. [531 S3] • The key of a subindex is inodeJd(o). 

Example 17 Fig. ITTl shows the instance index for the XML data in Fig. [ljb). Let o be the instance 
node with the value = "Jagadish" in Fig. W[b). Then, inodeJd(o) = 15, node_path(o) = 0.1.11.13.15, and 
label_path(o) = "bib. conf.paper. author. In" . Since numericJabel _path{o) = 0.1.6.8.10 for label-path{o) in the 
Data-Guide" 1 " in Fig. 2J a posting (15, 0.1.11.13.15, 0.1.6.8.10) is stored in the posting list of "Jagadish". □ 




<inode_id, node_path, numeric_label_path> 



Figure 17. An example instance index. 
4.2 Query Processing Method 

The query processing method consists of the following two steps. The first step presented in Section l4.2.1l 
translates a given keyword query Q into multiple XPath queries corresponding to the schema-level 
SLCAs. The second step presented in Section 14.2.21 evaluates the XPath queries obtained in the first 
step. 
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4.2.1 Query Translation 

We first compute schema-level SLCAs (or their ancestors) and then generate XPath queries specifying 
their schema structures. Fig. 1181 shows the algorithm Query Translation, which consists of the following 
two steps. 

In Step 1, we compute the set S of schema- level SLCAs using the GetSLCA function that im- 
plements the SLCA searching algorithm of Xu and Papakonstantinou |46j . They use this function to 
compute instance-level SLCAs, but we use it here to compute schema-level ones. For each schema-level 
SLCA sslcai, we add the snodeSd of sslcai to S. In iterative fcth-ancestor generalization, the algorithm 
is modified to find ancestors of the schema-level SLCAs. 

In Step 2, we generate an XPath query xpqi for each schema-level SLCA with the snodeJd Si G S. 
In the XPath query generated from s,, Si becomes the query result node and, at the same time, the 
branching query node since Si is a schema-level SLCA of all the query keywords; query keywords 
that are descendants of Sj become the leaf query nodes. Here, we first obtain the label path Ipi of 
Si by searching the LabelPath table using snodeJd(si). We then make the query string of xpqi by 
calling the MakeXPathQucryString function with Ipi and the query keywords. In Step 2.1 of the 
MakeXPathQueryString function, we do not create a predicate when Wi is the last label of Ip. It means 
that W{ is the label of the schema-level SLCA. Since it is a part of Ip already, a predicate for it is not 
needed. 

Example 18 We translate a keyword query "XML Levy" on the XML data in Fig. [2(b) into XPath 
queries xpqi and xpqi in Fig. [19] as follows. In Step 1, we first obtain the posting lists L\, L 2 of "XML", 
"Levy" by searching the schema index in Fig. 1161 We then compute the set T of numericJabel jpath 's 
of schema-level SLCAs for L\ and L 2 by evaluating GetSLCA(Li, L 2 ). Here, T — {"0.1.6", "0.11.12"}. 
For each sslcai G T, we add snodeJd(sslcai) to S. Thus, S = {6, 12} in Fig. [4] In Step 2, for the 
schema-level SLCA with the snodeSd s\ — 6 G S, we first obtain the label path "bib.conf.paper" of Si from 
the LabelPath table in Fig. [15] We note that the label jpathJd = si = 6. We then create predicates for 
"XML" and "Levy". The predicates are "[contains(., "XML")]" and "[contains(., "Levy")]". Finally, we generate 
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Algorithm 3 Query Translation 



Input: (1) a keyword query Q = {w,, w n }, (2) the schema index, 

(3) the LabelPath table 
Output: the set XPQ of XPath queries 
Algorithm: 

Step 1. Compute a set S of schema-level SLCAs 

1.1 5:= {}/* initialize */ 

1.2 Obtain posting lists L v L n of w p w n 

from the schema index 

1.3 T := GetSLCA(L ; , ...,L„) 

1 .4 For each schema-level SLC A sslccij e T 
1.4.1 Add snode_id(sslca) to S 

Step 2. Generate the set XPQ of XPath queries 

2.1 XPQ :={} /* initialize */ 

2.2 For each schema-level SLCA j,. e S 

2.2.1 Obtain the label path lp i of s i from the LabelPath table 

2.2.2 xpq i := MakeXPathQueryStringC//?,., w,, w„) 

2.2.3 XPQ := XPQ U {xpq,} 

2.3 Return XPQ 
Function MakeXPathQuerySting 

Input: (1) a label path lp, (2) query keywords w w n 
Output: an XPath query xpq 
Step 1. Convert "." in lp into "/" 

Step 2. For each query keyword w. (1%'^rc), create a predicate ex/?r ; . 

2.1 If w ; is a label and is not the last label of lp, expr j := Wj 

2.2 If w ; is a value, exprj := contains(., "w") 
Step 3. xpq := lp\_expr^\ . . . [expr n ] 

Step 4. Return xpq 



Figure 18. The query translation algorithm. 

the XPath query xpqi by concatenating the label path and the predicates. We similarly generate the 
XPath query xpq2 for the schema-level SLCA with the snodeJd S2 — 12. □ 



bib 

conf 
paper 



"XML" "Levy" 

/bib/ conf/paper 
[contains(., "XML")] 
[contains(., "Levy")] 

(a) xpqi. 



bib 
journal 
I article 



"XML" "Levy" 

/bib /journal/ article 
[contains(., "XML")] 
[contains^, "Levy")] 

(b) xpq2- 



Figure 19. The XPath queries generated from "XML Levy" 
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4.2.2 Query Evaluation 

The set of XPath queries obtained in the query translation step can be evaluated with any existing 
XPath engine. In this section, we propose an efficient algorithm that simultaneously evaluates the 
specific set of XPath queries generated by our method. 

In general, there are multiple structures matching the user's query intention, and thus, multiple 
XPath queries for those structures are generated from a keyword query. The result of the keyword query 
is the union of the results of these XPath queries. As explained in Section f4. 2. 1[ an XPath query xpqi 
generated from a schema-level SLCA Si has one branching node, i.e., Sj, and the label path of Sj is the 
path from the root node to Sj. Query keywords that are descendants of Sj become the leaf query nodes 
of xpqi. The query xpqi finds the instance nodes that have the label path of s$ and that contain all the 
query keywords (this is common to all xpq^s). We exploit this commonality for efficient simultaneous 
computation of multiple queries. 

There has been a lot of work on XPath evaluation, but most of the work focuses on answering 
one query at a time. Some research efforts 9, 29, 49 have been done on answering multiple queries 
simultaneously, but they are not optimized for the specific set of XPath queries that are generated by 
our method. Bruno et al. [9] and Zhang et al. [49] only handle linear XPath queries. Liu et al. [29] handle 
XPath queries with branches. This method is not suitable for the specific set of XPath queries because 
of the following reasons. They combine multiple queries into a single structure, called super-twig query, 
to exploit query commonalities. They only consider the scenario where query commonalities exist in the 
top parts — the parts close to the root node — of multiple original queries. However, in the specific set of 
XPath queries, much of the query commonalities exist in the bottom parts of the original queries, which 
consist of query keywords. Little query commonalities exist in the top parts since each query has a 
unique path from the root node to the branching node. Thus, in the worst case, the cost of the method 
is almost the same as that of processing one query at a time. In contrast, our algorithm simultaneously 
evaluates all the queries in this specific set by exploiting the query commonalities existing in the bottom 
parts of the original queries. 



30 



Since the queries in this specific set share the same query keywords that appear in the original 
keyword query, we can simultaneously evaluate all the queries by joining the posting lists of the query 
keywords. We obtain the posting lists from the instance index introduced in Section [4. 11 Suppose that 
XPath queries xpqi,xpq2, xpq m are obtained from a keyword query Q — {wi,u>2, ■ w n }. We perform 
an index nested-loop join over the posting lists Lj (l<j<n) of query keywords Wj. For each posting in 
the outer-most posting list L\, we identify the query to be evaluated from among xpqi (l<i<m). Thus, 
we simultaneously evaluate different queries while we are scanning L\. As explained in Section [4. II the 
posting of an instance node o has the form (inodeJ,d(o) , nodejpath(o), numeric label 4>ath(o)) where 
inodeJ,d(o) is the node id of o, node_path{o) the node path of o, and numericlabeljpath(o) the label 
path of o that is represented as a sequence of integer ids rather than labels, node-path(o) contains the 
ids of the ancestor nodes of o in the ascending order, and its last id is inodeld(o). A posting list is sorted 
in the ascending order of inodeld(o) . Hereafter, we refer to an instance node o by its posting for ease of 
exposition. For each posting o\ a in L\, we find the query to be evaluated using numeric Jabeljpath(oi a ) . 
For xpqi (l<i<m), if the path pi from the root node to the branching node of xpqi is a prefix of the 
label path of Oi a , xpqi must be the query that we need to evaluate for o\ a since xpqi finds the instance 
nodes that have the label path pi and that contain all the query keywords. Here, o\ a matches the query 
keyword w\ since o\ a is a posting of w\. We note that at most one xpqi is found since each query has a 
unique branching node. We compute the results only for the postings in L\ that have the corresponding 
XPath query to be evaluated. Thus, we avoid unnecessary computation of spurious results. We note 
that, in contrast, the SLCA algorithm [46j computes SLCAs for all postings in L\ incurring unnecessary 
computation. 

We now explain how we evaluate xpqi. Let di be the depth of the branching node of xpqi from the 
root node, and node.path(oia)[di] be the dith id of node_path(oi a ) . We need to check if the instance 
node o with the id nodejpath(oi a )[di\ contains all the query keywords Wj (l<j<n). Here, o corresponds 
to the query result since the branching node is the query result node in xpqi. o clearly contains u>i 
since o is an ancestor of o\ a - o contains Wj (2<j<n) if there exists Oj\, G Lj for each Lj such that 
nodejpath(ojb) and nodejpath(pi a ) have the same prefix from the root node to dj. Since we assign a 
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unique preorder id to each node in the XML data tree, nodejpath(ojb) and node_path(oi a ) have the same 
prefix from the root node to di if nodejpath(ojb)[di] = node_path(oi a )[di]. Let fc be nodejpath(oi a )[di], 
which is inodeJ,d(o). To check the existence of Ojt, G Lj such that node-path(ojb)[di] — fc, we utilize 
the subindex on Lj whose key is inodeJd of the posting in Lj, exploiting Lemmas Q] and [S] Here, 
we do not need to find all Ojb e Lj such that nodejpath(ojb) [di] — k since we only need to check if 
o — which corresponds to the query result — contains Wj . By Lemmas |4] and [5j to check the existence of 
Ojb G Lj such that nodejpath(ojb)[di] — fc, we only need to find a posting Ojb such that inodeJ,d(ojb) 
is the smallest id that is greater than or equal to k in Lj and check whether nodejpath(ojb)[di] = k. 
In summary, we simultaneously evaluate all the queries xpqi (l<i<m) through one scan of L\ and an 
index nested-loop join over the posting lists Lj (l<j<n). 

Lemma 4 inodeJ,d{ojb) > fc if node_path(ojb)[di] = k. 

Proof: It is straightforward since we assign a preorder id to each node. □ 

Lemma 5 Let inodeJ,d{ojb) be the smallest id that is greater than or equal to k in Lj . If node-path(ojb) [di] ^ 
fc, then there is no Ojb> £ Lj such that nodejpath{ojb>)[di\ = fc. 

Proof: Suppose that there exists Ojy 6 Lj such that nodejpath{ojb>)[di\ = fc. Then, as we see in 
Fig. 1201 Ojb' must be in the subtree rooted at o(fc), and Ojb must be in the right subtree of o(fc). Thus, 
inodeJd(ojb) > inodeJd(ojb') > k. This contradicts the assumption that inodeJd(ojb) is the smallest 
id that is greater than or equal to fc in Lj. □ 



Our algorithm uses the idea of XIR [35] that exploits the schema information — more precisely, 
the label path — for XPath query processing. XIR decomposes a given XPath query into linear XPath 
queries. A linear XPath query, which is also known as a linear path expression [35j , is an XPath query 
without branches. It then finds a set of result node paths by processing each linear XPath query, 
and performs prefix match join between the sets of result node paths. Here, the prefix match join [35j 





Figure 20. An example XML data tree for the proof of Lemma [5] 
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identifies the prefix (a subpath from the root to the branching node) of a node path on one side and 
finds the matching node paths having the same prefix on the other side of the join. In contrast to 
XIR, our algorithm simultaneously evaluates multiple XPath queries using the instance index without 
computing the result node paths a priori for each linear XPath query. In this sense, our algorithm is 
completely different from XIR. 

Fig. 1211 shows the query evaluation algorithm, which consists of the following two steps. 

In Step 1, we obtain necessary information for query evaluation from the XPath queries. For each 
XPath query xpqi (l<i<m), we first obtain the depth di of the branching node from the root node 
(simply, the branching depth). We then obtain the id label _path_idi of the label path from the root node 
to the branching node using the LabelPath table. 

In Step 2, we compute the results of the XPath queries. We first obtain the posting lists of the query 
keywords. We then scan the outer-most posting list L\ and perform an index nested-loop join over the 
posting lists Lj (l<j<n). For each posting o\ a G Li, we find the query xpqi to be evaluated in Step 2.3.1. 
If found, we do the inner loop step to check whether the node with the id nodejpath(o\ a )[di\ contains all 
the query keywords in Step 2.3.2.1. For each posting list Lj (2<j<n), we check the existence of Ojb G Lj 
such that nodejpath(pjiy)[di] = node_path(oi a )[di], by calling the FindMatchingPosting function in Step 
2.3.2.1.1. The FindMatchingPosting function finds such a posting using the subindex created on the 
posting list Lj based on Lemmas 2] and [SJ If a posting is found for every posting list Lj (2<j<n), we 
return nodejpath(oi a )[di] as the result of xpqi. 

Given a set of XPath queries {xpqi, xpq2, ■ xpq m } having the same query keywords {wi,W2, 
w n }, the worst case time complexity CxPath of the query evaluation algorithm is 0(|Li|(m+^™ =2 log\Lj\)) 
where Lj (l<j<n) is the posting list of Wj. For each posting in L\, we find the query to be evaluated 
from among the m queries and one posting from each of the other n — 1 posting lists. Finding a posting 
in Lj using the subindex costs 0(log\Lj\). 

We now compare the performance of our algorithm with that of the instance-level SLCA algo- 
rithm [J6]. The worst case complexity of the SLCA algorithm is 0(|Li|d^™ =2 log\Lj\) [1^ where d is 
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Algorithm 4 Query Evaluation 



Input: (1) a set of XPath queries {xpq v xpq m ) having the same 
query keywords w v w n , 
(2) the LabelPath table, (3) the instance index 
Output: the results of the XPath queries 
Algorithm: 

Step 1 . For each XPath query xpq i ( 1 < < m) 

1.1 rfj := the depth of the branching node from the root node 

1.2 label _path_id i := the id of the label path from the root node to 

the branching node 
Step 2. Perform an index nested-loop join 
2.1R:={} I* initialize */ 

2.2 Obtain the posting lists L ; , L n of w,, w n from the 

instance index 
/* outer loop */ 

2.3 For each posting o la e L, 

/* find xpq , to be evaluated */ 

2.3.1 For (' = 1 to m, find xpq i such that 

numericjabel _path(o la )[dil = label _j>ath_id i 
I* Note that at most one xpq i is found since each query has a 
unique branching node */ 

2.3.2 \ixpq t is found 
/* inner loop */ 

2.3.2.1 For each posting list L. (2<j<n) 

2.3.2.1.1 Check the existence of a posting o jb e Lj such that 
node _path{o jb )[d^\ = node _path{p Xl )\d^ 
by calling the function FindMatchingPosting 

2.3.2.2 If a posting is found for every posting list Lj (2<j<n) 
2.3.2.2.1 Add node _j>ath{o la )[d^ to R 

2.4 Return R 

Function FindMatchingPosting 

Input: (1) d p (2) node _path{o lt )[d^, (3) Lj 

Output: a posting o jb 

Step l.k:= node jjaf/i(o lo )[rf ; ] 

/* Check the existence of a posting o jb e Lj such that 

node _path(o jb )[d^ = k using the subindex and exploiting 

Lemmas 4 and 5 */ 
Step 2. Find a posting o jb e Lj such that inode_id{o jb ) is 
the smallest id that is greater than or equal to k 
using the subindex created on Lj 
Step 3. If node _path{o jb )[d^\ = k, return o jb 



Figure 21. The query evaluation algorithm, 
the maximum depth of the XML data. In practice, d of the SLCA algorithm and m of our algorithm 
are small and do not affect performance significantly. Thus, the "worst case" performance of the two 
algorithms is almost the same. The critical benefit of our algorithm over the SLCA algorithm is that we 
avoid unnecessary computation of spurious results by only computing the results of the XPath queries 
obtained from schema-level SLCAs. This effect comes from the fact that we compute the results only 
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for the postings in L\ that have the corresponding XPath query to be evaluated (in Step 2.3.2) while 
the SLCA algorithm computes SLCAs for all postings in L\. 

Example 19 We evaluate the XPath queries xpqi and xpqi in Fig. [19] as follows. In Step 1, the 
branching depth di = 3 for xpqi (i — 1,2). Since, in the LabelPath table in Fig. [15l the id of the label 
path "bib. conf. paper" is 6 and that of "bib.journal. article" is 12, label .path _id\ — 6 and label jpathjjda — 12. In 
Step 2, we first obtain the posting lists Li, Li of the query keywords "Levy", "XML" as shown in Fig. |2"21 
For the posting (inodeJ,d(oi a ) , node jpath(oi a ), numeric Jabeljpath(oi a )} — (10, 0.1.6.8.10, 0.1.6.8.10) G L\, 
numericJabeljpath(oi a )[di] = label _pathJdi, or equivalently, "0.1.6.8.10" [3] = 6. That is, "bib.conf.paper" 
of xpqi is a prefix of the label path "bib.conf.paper. author. In" that corresponds to numeric Jabel-path(oi a ) . 
Thus, xpqi is the query to be evaluated, and we do the inner loop step. We find a posting in Li such that 
nodejpath(o2b)[di] = nodejpath(oi a )[di] — "0.1.6.8.10" [3] = 6 using the subindex created on Li. Since 
there is a posting (7, 0.1.6.7, 0.1.6.7) 6 Li such that "0.1.6.7" [3] = 6, we return 6, which is the node id of 
paper(6) in Fig.QJb), as the result of xpqi. For the posting (106, 0.100.101.103.104.106, 0.11.12.14.15.17) e L x , 
we can similarly find the result artide(lOl) of xpqi. □ 



<10, 0.1. 


6.8.10, 0.1.6.8.10>, <106, 0.100.101.103.104.106, 0.11.12.14.15.17> 


1 


1 


<7, 0.1.6.7, 0.1.6.7>, <102, 0.100.101.102, 0.11.12.13>, <150, 0.100.150, 0.11.18> 



Figure 22. An example of Algorithm 4. 

5 Related Work 

There has been a lot of work on keyword search in relational databases [TJ [5] [T71 [TBI ESS H3] > which 
inspired XML keyword search. However, the work on relational databases is not directly applicable to 
XML since the schema of XML data cannot always be mapped to a rigid relational schema [15] due 
to the semi-structured and heterogeneous nature of XML. Our approach provides novel notions and 
algorithms that are suitable for the semi-structured and heterogeneous nature of XML and eliminates 
spurious results by exploiting the hierarchical nature of XML. 
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Extensive research has been done on XML keyword search. Under the assumption that smaller 
subtrees are more relevant to the query, most of the existing methods find the smallest subtrees con- 
taining all the query keywords based on the concepts of the LCA or its variants. Schmidt et al. [38] 
have introduced the notion of the LCA, and Guo et al. [T5] have defined a subset of LCAs and proposed 
an efficient ranking method for the subtrees rooted at the nodes in this set. Xu and Papakonstanti- 
nou [47] have studied the properties of LCAs to accelerate the computation. Hristidis et al. [19] have 
focused on computing the whole subtrees rooted at LCAs. Xu and Papakonstantinou 46] have proposed 
the concept of the SLCA and presented algorithms for finding SLCAs efficiently. Sun et al. [39] have 
proposed a method that processes keyword queries involving boolean operators AND and OR under 
the SLCA semantics. Li et al. [25] have proposed the concept of Meaningful LCA (MLCA), a concept 
similar to that of SLCA, and incorporated MLCA search in XQuery. Cohen et al. [11] have attempted 
to find meaningful results based on a heuristic called interconnection relationship, and Li et al. 26J have 
presented an efficient algorithm for the heuristic. 

Liu and Chen [31 have pioneered a novel method for inferring return nodes for XML keyword search. 
They have proposed a system called XSeek, which infers desirable return nodes by recognizing entities in 
the XML data. Huang et al. [21] have addressed the important problem of generating effective snippets 
(i.e., summaries) for XML search results. Liu and Chen [32j have proposed properties to find relevant 
nodes that matches query keywords in the subtree rooted at each SLCA. These schemes on generating 
return nodes are orthogonal to and can be incorporated into our method as we see in Section [6] 

Several research efforts [IT] [28J [48] have been made to enable users to exploit partial knowledge 
of the schema in user queries. The query models used in those methods are commonly called labeled 
keyword search [48], which allows the user to annotate query keywords with labels. For example, in 
labeled keyword search, "XML Levy" is expressed as "title:XML author:Levy" . Using this partial schema 
information, labeled keyword search can retrieve more meaningful results than simple keyword search 
that specifies only keywords. The search quality of labeled keyword search relies on the correctness of 
the labels in a given query [35] • However, a casual user is unlikely to have perfect knowledge of those 
labels 28 . Our method does not have this problem since it uses the simple keyword search model. 
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Yu and Jagadish [48 have proposed novel schema-based matching methods for labeled keyword 
search and Meaningful Summary Query (schema-aware query). They contrast with our framework that 
supports schema-free keyword search. They use the schema of XML data to define the matching 
semantics. In contrast, our method uses the schema to efficiently resolve structural anomaly instead. 

Most recently, Bao et al. [5] have proposed a probabilistic framework for inferring user's intention 
and ranking the query results. They compute the confidence level of each candidate node type, which 
is defined as a label path, using the statistics of the underlying XML data and use it to infer the 
user's intention. The method of Bao et al. processes queries at the instance level and additionally uses 
the schema to improve search quality. In contrast, our method, being primarily at the schema level, 
improves not only search quality using the schema but also search performance by processing queries at 
the schema level. 

Besides, there has been extensive work done by W3C to define a full-text extension of XQuery [41] . 
which has today many implementations such as GalaTex |13j . Amer-Yahia et al. [3] have presented 
efficient evaluation algorithms for full-text XQuery queries, and Pradhan [36j has demonstrated several 
optimization techniques. In this paper, our focus is to effectively and efficiently support "schema-free" 
XML keyword search where users only need to specify keywords as opposed to the full-text extension 
of XQuery where users must specify structure information as well as keywords according to the XQuery 
grammar. 

There has been a lot of work on ranking schemes [H [Q EJ [15l [17l [HI [20J [27l [30l [42] for keyword 
search over XML, RDF, or relational databases. The ranking schemes and the concept of structural 
consistency can complement each other to help users find relevant results. For example, enforcing 
structural consistency could be too restrictive for certain applications, i.e, some query results eliminated 
by structural consistency may be relevant to the query. In this case, we can exploit structural consistency 
as one of the ranking criteria that measures the meaningfulness [48j of the results rather than as a 
criterion for removing spurious results as has similarly been suggested by Yu and Jagadish [48j . 



37 



6 Experimental Evaluation 



6.1 Experimental Setup 

The goal of the experiments is to verify the advantage of our method in terms of search quality and 
search performance. As for search quality, we compare our method with SLCA [46] and MLCA [28] as 
they are the state-of-the-art methods; we exclude XSEarch [11] from the comparison since Li et al. [28] 
have shown that MLCA is generally superior to XSEarch. As for search performance, we compare 
our method with SLCA, excluding MLCA from the comparison, since Xu and Papakonstantinou [46] 
have shown that the SLCA searching algorithm generally shows superior performance over the MLCA 
searching algorithm. In addition, we compare the index creation time and index size of our method 
with those of the SLCA method to show that an extra schema index for efficient structural consistency 
checking incurs negligible overhead to overall system performance. We use precision and recall as the 
measure for search quality. Following the common practice 11, 26, 28J, we define the desired results of 
a keyword query as those returned by structured queries (XPath queries) corresponding to the keyword 
query, which are formulated by the users who participated in the experiments. We use the wall clock 
time as the measure for search performance and index creation, and the number of pages allocated for 
the index size. 

Independent of the query processing method, we need to specify which output (i.e., return nodes) 
generation strategies |31j to use: Subtree Return, Path Return, Subtree- Entity Return, and Path-Entity 
Return. Subtree Return outputs the whole subtree rooted at each query result. Path Return outputs 
the paths from the root of each query result to the query keywords. Subtree-Entity Return and Path- 
Entity Return first find the lowest entity ancestor-or-self node of each query result, and then, output 
the subtree rooted at the node and the paths from the node to the query keywords, respectively. In the 
same way as was done by Liu and Chen [31] . if a node with label l\ has a one-to-many relationship with 
nodes with label I2, we consider the nodes with label I2 as entities. According to Liu and Chen [31J, Path 
Return usually has higher precision but lower recall than Subtree Return since it returns only paths. 
The strategies with entities generally have higher precision and recall than the ones without entities. 
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We present experimental results using the output strategies with entities since these strategies 
show superior search quality over those without. We note that this superiority has also been verified 
in all the experiments we performed. Thus, we omit experimental results for the output strategies 
without entities. For complete experimental results including other output strategies, please refer to 
our technical report [53]. Hereafter, "SC" denotes our method; "S-E" a method with Subtree-Entity 
Return; and "P-E" a method with Path-Entity Return. For example, SC-S-E denotes our method with 
Subtree-Entity Return. 

We have performed experiments using three real data sets and one synthetic data set. The first 
one is the DBLP data set [34] . We use the same schema used in the experiments by Xu and Papakon- 
stantinou [46j . that groups the DBLP data set first by journal/conference names, and then, by years. 
The second one is the SIGMOD Record data set [34;. The third one is the NASA data set [34], which 
consists of astronomical data. It has a complex and recursive schema and allows a wider variety of 
queries than the DBLP and SIGMOD Record data sets. The fourth and synthetic one is the XMark 
benchmark data set available at the XMark web site [45] . These data sets have been extensively used in 
the existing work on XML keyword search [IH [TEl H3 I2HI Ell EHl ESI SHI HI] - Table M shows statistics 
of these data sets. We see that the size of the schema is significantly smaller than that of the XML 
data. 

Table 2. Data statistics. 



data set 


size 


# of instance nodes 


# of distinct 


# of schema nodes 


average 






(excl. value nodes) 


keywords 


(excl. keywords) 


depth 


SIGMOD Record 


0.5 MBytes 


15,263 


5,652 


12 


5 


DBLP 


127 MBytes 


3,736,406 


572,062 


145 


3 


NASA 


23 MBytes 


530,528 


48,430 


110 


6 


XMark 


111 MBytes 


2,048,193 


127,905 


548 


5 



Experiment 1: To compare search performance and analyze the relationship between search perfor- 
mance and precision/recall in a controlled setting, we have generated the queries in Table [3] for the 
DBLP, NASA, and XMark data set^j- To show the cases where our method has low precision or recall, 
which are seldom, we add the following queries: QD 6 , QD 7 , QX 6 , QX 7 , QN± ~ QN 7 . We also include 



7 For the XMark data set, the XMark benchmark queries are not used since the queries are expressed in XQuery and 
has complex semantics such as path expressions, join, aggregation, grouping, and ordering. Since keyword queries have 
inherently limited expressive power, it is not feasible to rewrite all the benchmark queries into keyword queries. For some 
queries that do not have complex semantics and can easily be converted into keyword queries, e.g., QX4 and QX7, we 
exploit them. 
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QD%, QXg, QNs to test the case where users specify very long queries containing 9 ~ 13 keywords. We 
run each query in Table[3]ten times and measure precision, recall, and the average wall clock time. Since 
how the underlying XML data are stored highly affects the query result construction time, which is not 
our focus, we only access the root node r of each query result and report the number of the descendant 
nodes of r for the Subtree-Entity Return when measuring the wall clock time of query performance. 



Table 3. Query sets. 



ID 


Query 


DBLP data set 


QD X 


"flexibility" 


QD 2 


"scheduling management" 


QD 3 


"quality analysis data" 


QDi 


"rule programming object system" 


QD 5 


"Levy J Jagadish H" 


QD 6 


"flexibility message scheme" 


QD 7 


"ICDE XML Jagadish" 


QD S 


"distributed data base systems performance analysis 




Michael Stonebraker John Woodfill" 


NASA data set 


QNi 


"astroObjects" 


QN 2 


"Michael magnitude" 


QN 3 


"photometry galactic cluster Astron" 


QNi 


"pleiades dataset" 


QN 5 


"PAZh components" 


QN e 


"pleiades journal" 


QN 7 


"textFile name" 


QN 8 


"accurate positions of 502 stars Eichhorn Googe 




Murphy Lukac" 


XMark data set 




"Zurich" 


QX 2 


"Arizona Mehrdad edu" 


Qx 3 


"Takano sun com mailto" 


QXi 


"homepage name" 


QX 5 


"Helena 96" 


QX 6 


"mehrdad takano net" 


QX 7 


"person id personO name" 


QX 8 


"harpreet mahony nodak edu 99 lazaro st el svalbard 




and jan mayen island" 



Experiment 2: To compare search performance for a real set of user queries, we have obtained two 
hundred queriejfl for each of the real data sets (a total of six hundred queries) — the DBLP, SIGMOD 
Record, and NASA data sets — from ten graduate students majoring in databases (but not involved in 
this project) for this purpose. We measure the wall clock time for all the queries. 

8 For the list of queries, please refer to http://dblab.kaist.ac.kr/-drlee/sc.html 
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Experiment 3: To show the superiority of the query evaluation algorithm presented in Section f4. 2. 21 
we compare search performance of our method that uses the algorithm and the one that uses XIR [35 , 
which does not process multiple XPath queries simultaneously. We measure the wall clock time for the 
six hundred queries used in Experiment 2. 

Experiment 4: To compare search quality for real sets of user queries, we measure precision and recall 
for the six hundred queries used in Experiment 2. 

Experiment 5: To compare the index creation timqj and index size, we measure the wall clock time 
and the number of pages allocated. 

Experiment 6: To test the scalability of our method, we generate XMark data sets by varying the 
size from 1 GBytes to 4 GBytes and from 100 MBytes to 10 GBytes. We measure the wall clock time 
for queries QX2, QX3, QX4, and QX$. 

All the experiments are conducted on SUN Ultra 60 workstation with UltraSPARC-II 450MHz 
CPU and 512 MBytes of main memory. We implement all the methods on ODYSS-EUS ORDBMS gj, 
which supports the inverted index. The page size for data and indexes is set to be 4096 bytes. We 
use the Indexed Lookup Eager algorithm [46 as the SLCA searching algorithm since it generally shows 
superior performance over other algorithms. Finally, all the methods are implemented using CH — h 

6.2 Experimental Results 

Experiment 1: Fig. [23] shows the precision, recall, and wall clock time for the queries QD\ ~ QD$ 
in Table [3] over the DBLP data set. SC-S-E (SC-P-E) improves the query performance by up to 2.4 
times (2.5 times) over SLCA-S-E (SLCA-P-E). The reason for the improvement is that our method 
eliminates spurious results early by enforcing structural consistency at the schema-level. We note that 
the recall values of our method and SLCA are the same. The improvement becomes more marked 
when the precision of SLCA is low, i.e., when the number of spurious results is high. For example, in 
Fig. l23T aL the precision of SLCA for QD4 is lower than that for QD3, and thus, in Fig. l23T c). the query 
processing time for QD4 is higher than that for QD3, while those of our method are hardly changed. 



5 In the index creation time, the time for XML document parsing, keyword extraction, and data loading is excluded. 
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However, if the precision of SLCA is high, i.e., when there are few spurious results, for a specific query, 
our method could be marginally slower than SLCA due to the overhead of XPath query evaluation and 
iterative fcth-ancestor generalization. For example, in Fig. [237c). our method is about 10% slower than 
SLCA for QDi and QD 5 . 
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Figure 23. Precision, recall, and wall clock time of queries in Table [3] for the DBLP data set. 

In Fig. 1231 a). our method shows low precision for QDq and QDj. For QDq, there is a conference 
paper on "flexibility message scheme" in the database, but no journal article. In this case, our method finds 
spurious journal nodes through generalization, resulting in low precision. For QDy, the user wants to 
find "ICDE" papers about "XML" authored by "Jagadish", but our method and SLCA return the whole 
subtree rooted at "ICDE" conference node (or the paths from the conference node to the query keywords), 
resulting in the same low precision. Even for such queries, the precision of our method is higher than 
or equal to that of SLCA since our method is able to eliminate more spurious results than SLCA. For 
example, for QDq, our method does not find spurious conf nodes since there is a paper on "flexibility 

message scheme" , but SLCA does. 

The reason why the SLCA method often has very low precision is that it often finds more spurious 
SLCA nodes than correct ones. For example, there are only five publications of "Levy" on "XML" in the 
DBLP data set, but the SLCA method finds 50 SLCAs for the query "XML Levy", 45 of which are spurious 
conf nodes. Furthermore, conf nodes typically include huge subtrees having thousands of nodes. Thus, 
the number of retrieved nodes that are spurious becomes very large leading to very low precision. The 
Subtree-Entity Return (S-E) has even lower precision because this strategy returns the whole subtree 
rooted at each query result, and the number of all nodes in the subtree is counted as the number of 
retrieved nodes. 
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Fig. [24] shows the precision, recall, and wall clock time for the NASA data set, having a tendency 
similar to that of the DBLP data set except QN4 and QN$. 
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Figure 24. Precision, recall, and wall clock time of queries in Table [3] for the NASA data set. 

For QN4, the recall of our method, SC-S-E and SC-P-E, is almost (both 1.3 x 10~ 4 since they 
find the same para nodes). This is because the user intends to find more general results, which we regard 
as spurious results. For Q-/V4, "pleiades dataset", the user wants to find the subtrees rooted at dataset nodes 
that contain the keyword "pleiades" . However, our method finds only the para nodes (i.e., paragraphs) that 
are contained in the subtrees rooted at the dataset nodes. Thus, we have very low recall. In contrast, the 
SLCA method finds (1) the para nodes and (2) the dataset nodes that do not have para nodes containing 
the keywords "pleiades" and "dataset" . (We note that the recall value of SLCA-S-E for QN4 looks perfect 
in Fig. 124T b). but it is not 1.0 since the SLCA method also finds the para nodes as our method does.) 
We can solve this low-recall problem using relevance feedback. The result is shown in Fig. [23 By using 
relevance feedback, we can generalize the para nodes to the dataset nodes and obtain the desired results. 
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(a) Precision. (b) Recall. (c) Wall clock time. 

Figure 25. Precision, recall, and wall clock time of QN4 with relevance feedback. 



For QN$, the precision and recall of our method are both constituting the worst case of our 
method. For QN^, "PAZh components", the user wants to find the subtrees rooted at the dataset nodes that 
(1) have altname nodes whose value is "PAZh" and (2) contain the keyword "components". However, our 
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method finds holding nodes since there are holding nodes that contain the keywords "PAZh" and "components" . 
In contrast, existing methods find (1) the holding nodes and (2) the desired dataset nodes. We can also 
solve this problem by generalizing the holding nodes to the dataset nodes. The result is shown in Fig. 1261 
In Fig. I26f a). the precision of our method is worse than existing methods because we find spurious 



results during generalization as explained in Example 1 141 of Section [33£^ 



while existing methods do not. 



That is, our method finds the dataset nodes that contain "PAZh" and "components" where the altname of the 
dataset node is not "PAZh" . 
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(a) Precision. (b) Recall. (c) Wall clock time. 

Figure 26. Precision, recall, and wall clock time of QN$ with relevance feedback. 

Fig. [27] shows the precision, recall, and wall clock time for the XMark data set, showing a similar 
tendency to those of the DBLP and NASA data sets. Similar to QN5 in the NASA dataset, QX5 
constitutes the worst case of our method. Fig. |2"51 shows the results of QX$ with relevance feedback. 
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Figure 27. Precision, recall, and wall clock time of queries in Table [3] for the XMark data set. 



Experiment 2: Fig. [2^1 shows the search performance results for a real set of user queries. The Y-axis 
represents the fraction of queries for which our algorithm has a given range of performance improvement 
over the SLCA algorithm. The performance improvement is defined as the wall clock time Tslca-s-e 
of SLCA over the wall clock time Tsc-s-e of SC and denoted as x. In Fig. [29l "-U" denotes our 



10 In Example 1141 conf_year nodes correspond to dataset nodes; chair to altname; "Levy" to "PAZh"; "XML" to "compo- 
nents"; paper to holding. 
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Figure 28. Precision, recall, and wall clock time of QX§ with relevance feedback. 



method with relevance feedback. For the NASA data set in Fig.^c), SC-S-E (SC-S-E-U) outperforms 
SLCA-S-E by more than 10% for 69% (66%) of queries. In contrast, SLCA-S-E outperforms SC-S-E 
(SC-S-E-U) for only 10% (12%) of queries. Figs. 1297 a) and (b) show a tendency similar to that of the 
NASA data set. We omit the results for the Path-Entity Return (P-E) since they show a tendency 
similar to those of the Subtree- Entity Return (S-E). 
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Figure 29. The search performance results of six hundred queries for the DBLP, SIGMOD Record, and 
NASA data sets. The Y-axis represents the fraction of queries for which our algorithm has a given range 
of performance improvement over the SLCA algorithm. 
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Experiment 3: Our method that uses the algorithm presented in Section [4.2.21 outperforms the one 
that uses XIR[35 by 1.8 ~ 5.2 times since the algorithm simultaneously evaluates multiple XPath 
queries while XIR evaluates one query at a time. 

Experiment 4: Figs. 1301 and I3T1 show the precision (denoted as p) and the recall (denoted as r) of two 
hundred queries over the DBLP data set and the SIGMOD Record data set, respectively. The Y-axis of 
the Figures represents the fractions of queries having given precision/recall ranges. MLCA and SLCA 
often find more spurious nodes than correct ones. For example, for the query "activity recognition" , they 
find 130 results, 122 of which are spurious conf or journal nodes. Thus, for the DBLP data set, the 
precision of SLCA and MLCA is less than 0.5 for 46% - 87% of queries! For the SIGMOD Record data 
set, their precision is less than 0.5 for 23% ~ 59% of queries. In contrast, the precision of our method 
is 1.0 for all queries since it eliminates all the spurious results by enforcing structural consistency. We 
note that the recall values of our method, MLCA, and SLCA are the same. These results are similar to 
those of Experiment 1. 
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Figure 30. Precision and recall of two hundred queries for the DBLP data set. The Y-axis represents 
the fraction of queries having a given precision/recall range. 
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In Fig.EHb), SC-S-E, MLCA-S-E, and SLCA-S-E show low recall for about 16% of queries. In this 
case, the users want the articles of an author, e.g., "Jennifer Widom" , but all methods return the author in 
the articles since the author is the lowest entity containing all the query keywords. However, SC-S-E-U 
shows perfect recall since it finds the articles of an author by using relevance feedback. The average 
number of relevance feedbacks provided by the users for the 200 queries on the SIGMOD Record data 
set is 0.36/query. 
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Figure 31. Precision and recall of two hundred queries for the SIGMOD Record data set. The Y-axis 
represents the fraction of queries having a given precision/recall range. 



Fig. I3"2l shows the precision and the recall of two hundred queries over the NASA data set. The 
precision of SLCA and MLCA is less than 0.5 for 35% ~ 56% of queries. In contrast, the precision 
of our method is less than 0.5 for only 9% ~ 10% of queries. Here, our method shows low precision 
for some queries due to the complex schema of the NASA data set. For example, for the query "radio 
journal" , the user wants to find journal articles on "radio" . Our method finds not only correct results but 
also spurious results such as revision nodes, as SLCA and MLCA do, since there are revision nodes that 
contain the keywords "radio" and "journal". 
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Figure 32. Precision and recall of two hundred queries for the NASA data set. The Y-axis represents 
the fraction of queries having a given precision/recall range. 

In Fig. 1327 b). for about 9% of queries, the recall values of our method without relevance feedback 
are lower than those of SLCA and MLCA due to the same reason as in Example [14] of Section 13.41 
However, by using the relevance feedback, we can archive higher recall values than SLCA and MLCA. 
The average number of relevance feedbacks provided by the users for the 200 queries on the NASA data 
set is 0.30/query. 

Experiment 5: Fig. [33] shows the index creation time and the index size. All methods use an inverted 
index for XML data and the Dewey index [31] to find the lowest entity ancestor of each query result. SC- 
S-E and SC-P-E additionally use the schema index for efficient structural consistency checking. Thus, 
the index creation time of SC-S-E and SC-P-E is about 5% ~ 7% longer, and the index size is about 5% 
~ 7% larger than those of SLCA-S-E and SLCA-P-E. This verifies that an extra schema index incurs 
negligible overhead to overall system performance. We note that the index is bigger than the original 
data due to the space required for storing id paths from the root to each node. SLCA-based methods 
have similar space overhead since they also use id paths, i.e., Dewey numbers. We could reduce the 



48 



space by exploiting the UTF-8 encoding as an efficient way to represent id paths, which was proposed 
by Tatarinov ct al. [40 . 
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Figure 33. Index creation time and index size for the DBLP and XMark data sets. 

Experiment 6: Figs. [34l and [35l show the processing time of queries QX2, QX3, QX4, and QX% as the 
data set size is varied from 1 GBytes to 4 GBytes and from 100 MBytes to 10 GBytes. As we can see, 
the processing time of all methods increases approximately linearly when the data set size increases and 
that our methods are largely superior or comparable to SLCA-based methods. 
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Figure 34. Query processing time with increasing data set size from 1 GBytes to 4 GBytes in a linear 
scale. 
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Figure 35. Query processing time with increasing data set size from 100 MBytes to 10 GBytes in a 
logarithmic scale. 
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7 Conclusions 

We have proposed a new notion of structural consistency (and structural anomaly) in XML keyword 
search. By exploiting structural consistency, we can eliminate spurious results having the same result 
structure consistently. We have introduced the concept of the result structure in Definition [3] and the 
smallest result structure in Definition [6l We have formally defined the structural anomaly in Definition [5] 
as a phenomenon where there exist result structures that structurally contain other result structures. 
We have defined the structural consistency as a property where there is no structural anomaly in the 
query results. 

We have proposed a naive algorithm that resolves structural anomaly at the instance level. We 
have then proposed an advanced algorithm that resolves structural anomaly at the schema level. To 
this end, we have formally analyzed the relationship between the set of schema-level SLCAs and the 
set of instance-level SLCAs in Lemmas [2] ~ [3l identified the discrepancies between them, and proposed 
the notion of iterative Ath-ancestor generalization to resolve the anomalies (false dismissal and phantom 
schema structures) that are caused by these discrepancies. We have formally proved that the proposed 
algorithms produce the same set of results preserving structural consistency in Theorem [TJ We have 
proposed a solution using relevance feedback for the problem where our method has low recall; this 
problem occurs when it is not the user's intention to find more specific results. We have provided an 
efficient algorithm that simultaneously evaluates multiple XPath queries generated by our method. We 
have implemented our method in a full-fledged object-relational DBMS. 

We have performed extensive experiments using real and synthetic data sets. Experimental results 
show that our method improves precision significantly compared with the existing methods while pro- 
viding comparable recall for most queries. Experimental results also show that our method improves 
the query performance over the existing methods significantly by removing spurious results early. 
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Appendix A. Proof of Lemma [2] 

Let {wi, W2, Wn} be the set of query keywords of Q, and h.h-- ■ ■ -lm be the incoming label path of srsi. We 
need to show that there always exists a schema-level SLCA s such that Zi .Z2 •■ • • -lm is a prefix of the label path 
of s. Since srsi is a smallest result structure of instance-level SLCAs, there exists an instance node v such that 
li.li-' ■ ■ -lm is the label path of v, and Wi, W2, w„ are descendants of v. It follows that there exists a schema 
node s a such that h.h-- ■ ■ .l m is the label path of s a and w\, W2, w„ are descendants of s a (i.e., srsi = ss(s a )) 
since the DataGuide + has every unique label path of instance nodes. Thus, by the definition of schema-level 
SLCA, there exists a schema-level SLCA s such that ss(s a ) ^ ss(s). □ 

Appendix B. Proof of Lemma 1 

Let ILP(srSi) be the incoming label path of srsi, and ILP(ssj) be the incoming label path of ssj. Since 
srsi~<ssj, ILP(srsi) is a proper prefix of ILP(sSj). This implies that there must exist a feth-ancestor s a 
(1 < k < depth(s)) of the schema-level SLCA s whose label path is the same as ILP(srSi). Here, ss(s a ) = srsi 
since the label path of s a is the same as ILP(srSi). □ 
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